pkgdown/extra.css

Skip to contents

What is TCRdist?

TCRdist (Dash et al., Nature 2017) quantifies the similarity between two T-cell receptors based on the concordance of their amino acid sequences in regions important for antigen recognition. The algorithm computes a weighted Hamming distance between two TCRs, using a BLOSUM62 substitution matrix to penalize amino acids mismatches between V/J segments and CDR3 loops.

TCRdist example
A schematic of TCRdist.

Why do we need another TCRdist implementation?

TCRdist was originally implemented in Python (tcr-dist), but computation becomes very slow when calculating pairwise TCRdist values for thousands of TCRs. Efforts to speed up TCRdist calculation in Python have been made using numba, a high-performance just-in-time (JIT) compiler (tcrdist3), Cython, which compiles Python code into fast C code (fast_tcrdist), and standard C++ (CoNGA). However, none of these implementations make use of GPUs, which can outperform compiled C++ executed on a CPUs.

We wrote a GPU-enabled TCRdist implementation in Python that is faster than all other existing implementations (although other implementations are certainly more feature-rich). To enable use with extra-large datasets of millions of TCRs, it batches computation, keeps results in sparse format, and can progressively write results to a text file rather than returning a cumbersomely large data object.

Our implementation is compatible with both NVIDIA and Apple Silicon GPUs and automatically detects a user’s GPU type. It uses cupy (for NVIDIA GPUs) or mlx (for Apple Silicon GPUs) to compute pairwise TCRdist for each batch of TCRs, using numpy as a backup when no GPU is available.

The TCRdist() function in TIRTLtoolsis a wrapper that allows us to run this fast GPU implementation via R using the reticulate package.

Notes on our implementation

  • Your input data frame must include the following columns (it may include additional columns):
    • va - V-segment for alpha chain
    • cdr3a - CDR3 amino acid sequence for alpha chain
    • vb - V-segment for beta chain
    • cdr3b - CDR3 amino acid sequence for beta chain
  • Our implementation currently ignores J-segments and calculates similarity based on only V-segment and CDR3 amino acid sequence.
  • Our implementation currently requires V-segments and CDR3 sequences for both alpha and beta chains. If you have single-chain data only, you can insert dummy values in the V and CDR3 columns for the missing chain to allow for compatibility with our function.
  • Our implementation uses a pre-calculated substitution matrix for amino acids and V-segments. TCRs whose amino acid sequences contain stop codons (*) or frameshifts (_) and TCRs with V-segments that are not found in the substitution matrix will be dropped. For a list of permitted amino acids and V-segments, see TIRTLtools::params$feature.
  • Our V-segments include an allele identifier, e.g. “TRAV13-1*03”. If some or all of your V-segments do not include alleles, the function will automatically add “*01” to them.
  • You may also run TCRdist directly in Python https://github.com/NicholasClark/TCRdist_gpu. However, the steps above of preparing the data (dropping improper TCRs and adding alleles) are not part of the Python function, so you may want to run prep_for_tcrdist() on your data first and write that data frame to a file.

TCRdist example

The following example computes TCRdist values for all pairs of annotated TCRs in the VDJdb database. It outputs all values with TCRdist <= 90.

Load the package

## Error in system("nvidia-smi", intern = TRUE, ignore.stderr = TRUE) : 
##   error in running command

Load an example dataset (annotated TCRs from VDJ-db)

tcr1 = TIRTLtools::vdj_db
### note: You may replace this with a file of your choice - needs to have columns "va", "vb", "cdr3a", and "cdr3b"

paged_table(tcr1)

Run TCRdist

result = TCRdist(tcr1 = tcr1, tcrdist_cutoff = 90, chunk_size = 5000)
## Checking for available GPU...
## 
## No supported GPU detected (NVIDIA or Apple Silicon).
## Checking for GPU-related Python modules...
## 
## Neither 'cupy' or 'mlx' are installed
## Loading numpy to perform TCRdist
## Number of chunks: 15
## 20% done
## Time taken so far: 10.937966 seconds
## 27% done
## Time taken so far: 16.210805 seconds
## 40% done
## Time taken so far: 26.695573 seconds
## 47% done
## Time taken so far: 31.950689 seconds
## 60% done
## Time taken so far: 42.461766 seconds
## 67% done
## Time taken so far: 47.725553 seconds
## 80% done
## Time taken so far: 58.247288 seconds
## 87% done
## Time taken so far: 63.502493 seconds
## 100% done
## Time taken so far: 74.020472 seconds
## Total time taken: 86.015819 seconds

Inspect the output

edge_df = result[['TCRdist_df']] ### table of TCRdist values <= cutoff
meta_df = result[['tcr1']] ### table of input data with indices

paged_table(edge_df)
paged_table(meta_df)